Athena: Mining-based Interactive Management of Text Databases

نویسندگان

  • Rakesh Agrawal
  • Roberto Bayardo
  • Ramakrishnan Srikant
چکیده

We describe Athena: a system for creating, exploiting, and maintaining a hierarchical arrangement of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject. We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. CEvolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Athena: Mining-Based Interactive Management of Text Database

We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive ...

متن کامل

Competitive Intelligence Text Mining: Words Speak

Competitive intelligence (CI) has become one of the major subjects for researchers in recent years. The present research is aimed to achieve a part of the CI by investigating the scientific articles on this field through text mining in three interrelated steps. In the first step, a total of 1143 articles released between 1987 and 2016 were selected by searching the phrase "competitive intellige...

متن کامل

Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories

Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflo...

متن کامل

Argo: an integrative, interactive, text mining-based workbench supporting curation

Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variet...

متن کامل

Interactive Predictive Analytics with Columnar Databases

Predictive Analytics is usually seen as highly interactive task. Paradoxically , it is still performed mostly as a batch task. This does not only limit its applicability , it also sets it apart from a task that is conceptually very close to it, namely OLAP analysis. The main reason for considering mining a batch task is the usually very high execution time on large data warehouses. While novel ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999